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Abstract 

In this paper, we evaluate convolutional neural net¬ 
work (CNN) features using the AlexNet architecture de¬ 
veloped by Em and very deep convolutional network (VG- 
GNet) architecture developed by nsm- To date, most CNN 
researchers have employed the last layers before output, 
which were extracted from the fully connected feature lay¬ 
ers. However, since it is unlikely that feature representation 
effectiveness is dependent on the problem, this study eval¬ 
uates additional convolutional layers that are adjacent to 
fully connected layers, in addition to executing simple tun¬ 
ing for feature concatenation (e.g., layer 3 + layer 5 + layer 
7) and transformation, using tools such as principal compo¬ 
nent analysis. In our experiments, we carried out detection 
and classification tasks using the Caltech 101 and Daimler 
Pedestrian Benchmark Datasets. 


1. Introduction 

Over the past few years, convolutional neural networks 
(CNNs) have significantly improved from the standpoint of 
the network architectures needed to facilitate recognition 
accuracy and to reduce processing costs na. Currently, 
CNNs are primarily used to help users understand objects 
and scenes in an image. In our study, we applied a CNN 
to an ImageNet dataset containing over 1.4 million images 
and 1,000 object categories 03. Use of such a large-scale 
dataset allows us to model a wide variety of object recog¬ 
nition image features. By using the pre-trained ImageNet 
dataset model, we found that CNN is capable of presenting 
significantly more effective feature variations. 

For feature extraction, Donahue et al. employed CNN 
features as a feature vector by combining those features 
with a support vector machine (SVM) classifier m, while 
other researchers have evaluated and visualized CNN fea¬ 
tures with an eight-layer AlexNet architecture a. More 
recent architectures utilize deep structures, such as the 


very deep convolutional network (VGGNet) M and 
GoogLeNet lEl, which were developed by Oxford Univer¬ 
sity’s Visual Geometry Group and Google Inc., respectively. 

According to He et al. a, the most important CNN fea¬ 
ture is deep architecture. Along this line, the VGGNet con¬ 
tains 16 to 19 layers and GoogLeNet utilizes 22 layers. VG¬ 
GNet is frequently used in the computer vision field, not 
only in full scratch neural net models, but also as a fea¬ 
ture generator. CNN’s utility as a feature generator is also 
important because it can function well even if only a few 
learning samples are available. Thus, large-scale databases 
such as ImageNet can provide recognition rates that outper¬ 
form human-level classification (e.g., Em). However, 
this performance will fluctuate depending on the amount 
and variance of the data. Therefore, when CNN is used for 
feature generation, it provides better performance for some 
recognition problems than others. 

Donahue et al. argued that usage should be limited to 
the last two layers before output, which are extracted from 
first and second fully connected layers in CNN features with 
AlexNet. However, we believe that more detailed evalu¬ 
ations should be undertaken since several different archi¬ 
tectures have recently been proposed, and because middle 
layers have not been examined as feature descriptors. Ac¬ 
cordingly, in this study, we performed more detailed experi¬ 
ments to evaluate two famous CNN architectures - AlexNet 
and VGGNet. In addition, we carried out simple tuning for 
feature concatenation (e.g., layer 3 + layer 5 + layer 7) and 
transformations (e.g., principal component analysis: PCA). 

The rest of this paper is organized as follows. In Sec¬ 
tion 2, related works are listed. The feature settings are 
evaluated in Section 3. The results are shown in Section 
4. Finally, we conclude the paper in Section 5. 

2. Related works 

In the time since the neocognitron was first proposed by 
Fukushima E, neuron-based recognition has become one 
of the most commonly used neural network architectures. 
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Figure 1. AlexNet and VGGNet architecture. 


Following that study, the LeNet-5 Qol neocognitron model 
added a baseline to CNNs in order to create a more signifi¬ 
cant model. Current network architectures include standard 
structures such as multiple fully connected layers, while 
recent challengers employ pre-trained 17], dropout 0, 
and rectihed linear units (ReLU) ifTTl as improved learning 
models. The most outstanding computer vision result was 
obtained by AlexNet in the 2012 ImageNet Large Scale Vi¬ 
sual Recognition Challenge (ILSVRC2012), which remains 
the image recognition leader, with 1,000 classes 13. 

AlexNet made it possible to increase the number of lay¬ 
ers in network architectures. For example, Krizhevsky et 
al. implemented an eight-layer model that includes con¬ 
volution, pooling, and fully connected layers. More recent 
variations, such as the 16- or 19-layer VGGNet ca, and 
the 22-layer GoogLeNet II17I models, have even deeper ar¬ 
chitectures. These deeper models outperform conventional 
models on the ILSVRC dataset ns. More specifically, 
when compared to the AlexNet (top-five error rate on the 
ILSVRC2012; 16.4%), deeper models achieved better per¬ 
formance levels with GoogLeNet and VGGNet (top-five er¬ 
ror rate on the ILSVRC2014; 6.7% for GoogLeNet and 
7.3% for VGGNet). Currently, the object detection prob¬ 
lem is one of the most important topics in computer vision. 
The existing state-of-the-art framework, regions with con¬ 
volutional neural networks (R-CNN), was proposed by Gir- 
shick etal. 0. This framework consists of two steps during 
which (i) object areas are extracted as object proposals, and 
(ii) CNN recognition is performed. Those authors adopted 
selective search as an object proposal approach and 


VGGNet for the CNN architecture. However, while they 
restricted the object detection and recognition tasks to fully 
connected CNN features, we believe that the features of the 
other layers should be more carefully evaluated in order to 
determine whether they could provide more accurate recog¬ 
nition and detection. 


3. Feature settings and representations 

In this paper, we evaluate two deep learning feature 
types. Figure [T] shows the architectures of AlexNet ||3 and 
VGGNet US). We believe that while the evaluation itself 
is very important, particular attention must be paid to tun¬ 
ings such as concatenation and feature transformation. Ba¬ 
sically, deep learning architectures are based on their ap¬ 
proaches. 

Feature setting. We begin by extracting the middle and 
deeper layers. Layers 3-7 of AlexNet and VGGNet are 
shown in Figure[T] Next, we extract each max-pooling layer 
(layers 3-5), and the last two fully connected layers (layers 
6 and 7) in VGGNet. 

Concatenation and transformation. Next, we con¬ 
catenate neighboring or one-step layers such as layer-3,4,5 
and layer-3,5,7. In feature transformation, we simply apply 
PC A, which is set at 1,500 dimensions in this experiment. 

Classifier. In the next step, we apply deep learning fea¬ 
tures and SVM for object recognition. The parameters are 
based on DeCAF HI. 
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Figure 2. Comparison of AlexNet and VGGNet features on the Daimler Pedestrian Benchmark Dataset. CNN layers 3-7 are listed. 



Figure 3. Comparison of AlexNet and VGGNet features on the Caltech 101 Dataset. CNN layers 3-7 are listed. 


4. Experiments 

In this section, we discuss our experiments conducted 
using the Daimler pedestrian benchmark El and Caltech 
101 IJl Datasets. Figure |2] and [3] show the results of our 
deep CNN feature evaluations on the Daimler and Caltech 
101 datasets, respectively. The figures also show VGGNet, 
AlexNet, and their compressed features with PCA (VG- 
GNet(PCA) and AlexNet(PCA)). 

In the Daimler dataset experiment, we found that the 


VGGNet(PCA) layers 5 and 4 showed the best performance 
rates at 99.35% and 98.92%, respectively. We also deter¬ 
mined that PCA transforms low-dimensional features and 
feature vectors at better rates than the original features. The 
VGGNet layer 5 (98.91%) and layer 4 (98.81%) are, re¬ 
spectively, h- 0.44% and H-0.11% improved with PCA. When 
AlexNet is used, layers 3 and 4 show top rates of 98.71% 
and 97.95%, respectively. As for VGGNet, layers 5 and 
6 achieved the best results (91.8%) on the Caltech 101 
dataset. However, these results show significant layer 5 dif- 
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Table 1. Feature concatenation on the Daimler pedestrian bench¬ 
mark dataset. The highest rate for each architecture is shown in 
bold. 


Layer 

VGGNet 

AlexNet 

345 

97.38 

97.45 

456 

98.73 

96.98 

567 

99.38 

97.04 

357 

95.96 

97.26 


Table 2. Feature concatenation on the Caltech 101 dataset. The 
highest rate for each architecture is shown in bold. 


Layer 

VGGNet 

AlexNet 

345 

78.13 

77.95 

456 

85.03 

77.06 

567 

92.00 

77.38 

357 

73.07 

77.91 


ferences between VGGNet (91.8%) and AlexNet (78.37%). 

From the above results, it can be seen that features ob¬ 
tained from fully connected layers do not always provide 
the highest performance rates during recognition and detec¬ 
tion tasks, and that middle-layer features are more flexible 
for some tasks. We also found that fully connected layers or 
max-pooling layers located near fully connected layers tend 
to perform better in general object recognition tasks, such 
as the Caltech 101 dataset. 

The main difference between AlexNet and VGGNet is 
the architecture depth. Additionally, VGGNet assigns very 
small 3x3 convolutional kernels against the 7x7 (Conv 
1), 5 X 5 (Conv 2), and 3x3 (others) kernels in AlexNet. 
The settings refrain the feature representation. 

The classification results of concatenated vectors are 
shown in Table [T] and |2] Here, it can be seen that con¬ 
catenation of VGGNet layer-5,6,7 provides the highest lev¬ 
els of accuracy for both datasets. The rates are 99.38% on 
the Daimler dataset and 92.00% on the Caltech 101 dataset. 
For AlexNet, layer-3,4,5 and layer-3,5,7 achieved top per¬ 
formance rates on those datasets. The results show that 
combining features of the convolutional and fully connected 
layers provides better performance. It is especially notewor¬ 
thy that VGGNet layer 5, which is near the fully connected 
layer, provides significantly high levels of feature extraction 
from an image patch. 

5. Conclusion 

In this paper, we evaluated two different of convolu¬ 
tional neural network (CNN) architectures AlexNet and 
VGGNet. The convolutional features from layers 3-7 were 
performed on the Daimler pedestrian benchmark and Cal¬ 


tech 101 datasets. We then attempted to implement fea¬ 
ture concatenation and PCA transformation. Our experi¬ 
mental results show that the fully connected layers did not 
always perform better for recognition tasks. Additionally, 
the experiments using the Daimler and Caltech 101 datasets 
showed that layer 5 tends to provide the highest level of ac¬ 
curacy, and that feature concatenation of convolutional and 
fully connected layers improves recognition performance. 
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